-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Dev] Fix Linear-Cross-Entropy Convergence Issue #2739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
/ok to test 3cae089 |
…ar-Cross-Entropy Fusion with existing checkpoints
2. Fix LinearCrossEntropyModule 3. Code format
Jianbing-D
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kernel modifications looks good to me
|
/ok to test 483c1a8 |
|
/ok to test 22f770f |
|
/ok to test 0217dd6 |
|
Hi @NVIDIA/core-devtech @NVIDIA/core-nemo , could you help review this PR? Jianbing has reviewed and approved the changes to the kernel. |
lhb8125
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@ISEEKYAN Could you also review this PR so that it meets the requirement for RL?
ISEEKYAN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is OK for RL worker now. We can set reduction=None to get log_probs which is the necessary output for RL. It is better to enable this kernel with ability of giving entropy along with log_prob, we can do this in a new PR. cc @HaochenYuan
|
/ok to test 5fd23e6 |
|
/ok to test 2d2c30c |
|
Hi Jack, could you please add the link to the main PR in the desc? Thanks. |
What does this PR do ?
main branch: #3076
NOTE: The kernel will be migrated to TE in future updates, and the main branch will depend on this version of the kernel.
1. Fix: incorrect parameter reference in Linear-Cross-Entropy Fusion for non-shared embeddings
When using non-shared embeddings,
Linear-Cross-Entropy Fusionincorrectly references the shared embedding weight. In this scenario, the return value ofshared_embedding_or_output_weightshould not be used. Instead, it should explicitly referenceoutput_layer.weight.Additionally, since
Linear-Cross-Entropy Fusiondoes not invoke theoutput_layermodule, m-core DDP’s parameter gathering check may fail. To address this, I subclassedoutput_layer(i.e.,ColumnParallelLinear) and moved the Linear-Cross-Entropy logic into itsforwardmethod.2. Improvement: Use FP32 accumulation in multi-chunk computation for better convergence
The previous implementation of Linear-Cross-Entropy Fusion split computation along the vocabulary dimension and accumulated gradients in BF16, using the operation:
This caused
d_hiddento accumulate in BF16, which could lead to reduced numerical accuracy.This PR fixes the issue by allocating an FP32 buffer for
d_hiddenduring accumulation, then downcasting it back to BF16 after accumulation completes. This improves numerical stability and overall convergence.Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
[email protected]or[email protected].Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.